Assignment 1

Part of a data scientist's job is to use her/his intuition and insight to write algorithms and heuristics. A data scientist also creates mathematical models to make predictions based on some attributes from the data that they are examining.

In this assignment, you will take your knowledge and intuition about the Titanic and its passengers' attributes to predict whether or not a passenger survived or perished (later on in the course we will use machine learning to make predictions).

Assignment Details

You are given a list of Titantic passengers and their associated information.

For this exercise, you need to write a simple heuristic that will use the passengers' gender to predict if that person survived the Titanic disaster.

You prediction should be >= 78% accurate.

Here's a simple heuristic to start off: 1) If the passenger is female, your heuristic should assume that the passenger survived.

2) If the passenger is male, you heuristic should assume that the passenger did not survive.

Data Features:

  1. survival -> Survival (0 = No; 1 = Yes)
  2. pclass -> Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  3. name -> Name
  4. sex -> female/male
  5. age -> passenger name
  6. sibsp -> Number of Siblings/Spouses Aboard
  7. parch -> Number of Parents/Children Aboard
  8. ticket -> Ticket Number
  9. fare -> Passenger Fare
  10. cabin -> Cabin
  11. embarked -> Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Data taken from Kaggle


In [1]:
import numpy
import pandas

'''
Goal # 1
Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the passenger's id (which can be accessed
via passenger["PassengerId"]) and the associated value should be 1 if the
passenger survied or 0 otherwise.

For example, if a passenger is predicted to have survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1

And if a passenger is predicted to have perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0

Goal # 2 
Calculate accuracy of your predictions
Accuracy = (TP + TN)/(TP + TN + FP + FN)

'''

# Read Titanic Data into a Data Frame
df = pandas.read_csv('/Users/annette/Desktop/IntroToDataScienceClass/Lesson1/Numpy and Pandas/TitanicData.csv')

In [10]:
# Goal #1

# Store predictions in a dictionary
predictions = {}

#You can access the gender of a passenger via passenger['Sex'].
#If the passenger is male, passenger['Sex'] will return a string "male".
#If the passenger is female, passenger['Sex'] will return a string "female".
for passenger_index, passenger in df.iterrows():
    passenger_id = passenger['PassengerId']
    if(passenger['Sex']=='female'):
        predictions[passenger_id] = 1;
    else:
        predictions[passenger_id] = 0;

In [39]:
# Goal 2
prediction = 0
TP = 0.0
TN = 0.0
FP = 0.0
FN = 0.0

for passenger_index, passenger in df.iterrows():
    passenger_id = passenger['PassengerId']
    
    # Implement Heuristic
    if passenger['Sex']=='female':
        predictions[passenger_id] = 1;
        if passenger['Survived']==1:
            TP += 1
        else:
            FP += 1
    else:
        predictions[passenger_id] = 0;
        if passenger['Survived']==0:
            TN += 1
        else:
            FN += 1
            
accuracy = (TP + TN)/(TP + TN + FP + FN)        
        
print accuracy * 100


78.6756453423

In [49]:
# Re-initialize variables
TP = 0.0
TN = 0.0
FP = 0.0
FN = 0.0

# Loop through dictionary
for passengerID,predictionVal in predictions.items():
    
    # Get value at a specific location
    # Find the index for the current passenger and give me 
    # what is in element 'Survived' for that index.
    actual = df.at[df[(df['PassengerId']== passengerID)].index[0],'Survived']
    
    if actual == 1:
        if actual == predictionVal:
            TP += 1
        else:
            FP += 1
    else:
        if actual == predictionVal:
            TN += 1
        else:
            FN += 1
        
accuracy = (TP + TN)/(TP + TN + FP + FN)        
        
print accuracy * 100


78.6756453423

Can you do better?


In [ ]: